{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "38cbf625",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/query_engine/polars_query_engine.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b63bbc7b",
   "metadata": {},
   "source": [
    "# Polars Query Engine\n",
    "\n",
    "This guide shows you how to use our `PolarsQueryEngine`: convert natural language to Polars python code using LLMs.\n",
    "\n",
    "The input to the `PolarsQueryEngine` is a Polars dataframe, and the output is a response. The LLM infers dataframe operations to perform in order to retrieve the result.\n",
    "\n",
    "**WARNING:** This tool provides the LLM access to the `eval` function.\n",
    "Arbitrary code execution is possible on the machine running this tool.\n",
    "While some level of filtering is done on code, this tool is not recommended \n",
    "to be used in a production setting without heavy sandboxing or virtual machines.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a03b075",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9aa321b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index llama-index-experimental polars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2b92b92",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "import polars as pl\n",
    "from llama_index.experimental.query_engine import PolarsQueryEngine\n",
    "\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45863f52",
   "metadata": {},
   "source": [
    "## Let's start on a Toy DataFrame\n",
    "\n",
    "Here let's load a very simple dataframe containing city and population pairs, and run the `PolarsQueryEngine` on it.\n",
    "\n",
    "By setting `verbose=True` we can see the intermediate generated instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cbf8631",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test on some sample data\n",
    "df = pl.DataFrame(\n",
    "    {\n",
    "        \"city\": [\"Toronto\", \"Tokyo\", \"Berlin\"],\n",
    "        \"population\": [2930000, 13960000, 3645000],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25af48ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = PolarsQueryEngine(df=df, verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c72fdab",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\n",
    "    \"What is the city with the highest population?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef723808",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7a72809",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get polars python instructions\n",
    "print(response.metadata[\"polars_instruction_str\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6e040b",
   "metadata": {},
   "source": [
    "We can also take the step of using an LLM to synthesize a response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83dfb496",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = PolarsQueryEngine(df=df, verbose=True, synthesize_response=True)\n",
    "response = query_engine.query(\n",
    "    \"What is the city with the highest population? Give both the city and population\",\n",
    ")\n",
    "print(str(response))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f235e9b",
   "metadata": {},
   "source": [
    "## Analyzing the Titanic Dataset\n",
    "\n",
    "The Titanic dataset is one of the most popular tabular datasets in introductory machine learning\n",
    "Source: https://www.kaggle.com/c/titanic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31bee625",
   "metadata": {},
   "source": [
    "#### Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4679fb4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69fb308a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pl.read_csv(\"./titanic_train.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f64c947a",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = PolarsQueryEngine(df=df, verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c40048f",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(\n",
    "    \"What is the correlation between survival and age?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d7b59b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539a0c05",
   "metadata": {},
   "source": [
    "## Additional Steps\n",
    "\n",
    "### Analyzing / Modifying prompts\n",
    "\n",
    "Let's look at the prompts! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1e53f59",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import PromptTemplate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1750a95",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = PolarsQueryEngine(df=df, verbose=True)\n",
    "prompts = query_engine.get_prompts()\n",
    "print(prompts[\"polars_prompt\"].template)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26a0681c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(prompts[\"response_synthesis_prompt\"].template)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92999815",
   "metadata": {},
   "source": [
    "You can update prompts as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfce5308",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_prompt = PromptTemplate(\n",
    "    \"\"\"\\\n",
    "You are working with a polars dataframe in Python.\n",
    "The name of the dataframe is `df`.\n",
    "This is the result of `print(df.head())`:\n",
    "{df_str}\n",
    "\n",
    "Follow these instructions:\n",
    "{instruction_str}\n",
    "Query: {query_str}\n",
    "\n",
    "Expression: \"\"\"\n",
    ")\n",
    "\n",
    "query_engine.update_prompts({\"polars_prompt\": new_prompt})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb8a5357",
   "metadata": {},
   "source": [
    "This is the instruction string (that you can customize by passing in `instruction_str` on initialization)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83505248",
   "metadata": {},
   "outputs": [],
   "source": [
    "instruction_str = \"\"\"\\\n",
    "1. Convert the query to executable Python code using Polars.\n",
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n",
    "3. The code should represent a solution to the query.\n",
    "4. PRINT ONLY THE EXPRESSION.\n",
    "5. Do not quote the expression.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42394538",
   "metadata": {},
   "source": [
    "### Implementing Query Engine using Query Pipeline Syntax\n",
    "\n",
    "If you want to learn to construct your own Polars Query Engine using our Query Pipeline syntax and the prompt components above, you can adapt the techniques from our Pandas Query Pipeline tutorial.\n",
    "\n",
    "[Setting up a Pandas DataFrame query engine with Query Pipelines](https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline_pandas.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
